Protein features fusion using attributed network embedding for predicting protein-protein interaction

Background Protein-protein interactions (PPIs) hold significant importance in biology, with precise PPI prediction as a pivotal factor in comprehending cellular processes and facilitating drug design. However, experimental determination of PPIs is laborious, time-consuming, and often constrained by technical limitations. Methods We introduce a new node representation method based on initial information fusion, called FFANE, which amalgamates PPI networks and protein sequence data to enhance the precision of PPIs’ prediction. A Gaussian kernel similarity matrix is initially established by leveraging protein structural resemblances. Concurrently, protein sequence similarities are gauged using the Levenshtein distance, enabling the capture of diverse protein attributes. Subsequently, to construct an initial information matrix, these two feature matrices are merged by employing weighted fusion to achieve an organic amalgamation of structural and sequence details. To gain a more profound understanding of the amalgamated features, a Stacked Autoencoder (SAE) is employed for encoding learning, thereby yielding more representative feature representations. Ultimately, classification models are trained to predict PPIs by using the well-learned fusion feature. Results When employing 5-fold cross-validation experiments on SVM, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets, respectively. Conclusion Experimental findings across various authentic datasets validate the efficacy and superiority of this fusion feature representation approach, underscoring its potential value in bioinformatics.


Background
The principles of protein-protein interactions (PPIs) involve various aspects such as physical and chemical interactions, molecular recognition mechanisms, and dynamic regulation in living organisms [1].PPIs are crucial for various biological processes and can be categorized as permanent or brief interactions.Permanent interactions form stable complexes, while brief interactions are dynamic and reversible [2,3].Proteins have specific recognition motifs that allow them to interact selectively with their target proteins [4].Understanding PPIs is vital for unraveling biological processes, identifying therapeutic targets, and developing drugs to modulate specific interactions [5,6].
Performing biological experiments for detecting PPIs is the most common way to observe how they function.With the development of biological techniques, more PPI data have been collected from high-throughput experiments such as protein chips, yeast two-hybrid (Y2H) systems, mass spectrometry protein complex identification (MS-PCI), and others [4,[7][8][9].Nevertheless, carrying out the biological experiment methods is costly, labor-intensive, and has a long cycle [10].
Proteins within cells form complex signaling networks through interactions, which govern crucial aspects such as the cell's lifecycle, metabolic pathways, and signal transduction [11].Thanks to advancements in high-throughput experimental methods, such as mass spectrometry analysis and protein interactomics, it has become easier to access a large amount of PPI data [12].These cutting-edge technologies have facilitated the accumulation of extensive PPI data, which serves as the foundation for predictive research.By integrating and analyzing this wealth of data, we can construct comprehensive protein-protein interaction networks that enable us to gain deeper insights into the essence of protein function and cellular processes [13].Moreover, these PPI datasets not only provide valuable resources for experimental validation but also serve as crucial training and evaluation benchmarks for the development of prediction models and algorithms [14].
Recently, numerous computational methods have been developed to predict protein-protein interactions (PPIs), which play a crucial role in understanding biological processes and diseases [15][16][17][18][19][20][21].These methods aim to generate prediction results with high confidence, facilitating further research on PPIs.For instance, Wang et al. (2019) proposed a deep learning-based method achieving a high accuracy of 97.31% in a human-related dataset [16].Computational approaches, such as deep learning and graph-based representations, learn patterns from existing data to predict interactions accurately, thus improving the efficiency and precision of biological experiments.Jha et al. integrated protein sequence-derived features with graph-based representations using Graph-BERT encoding, while Huang et al. introduced SGPPI, a structure-based deep learning framework leveraging Alpha-Fold2's monomer structures and graph convolutional networks [21].TAGPPI is another novel framework utilizing protein sequence data alone, outperforming existing methods and marking the first utilization of predicted protein topology structure graphs for sequence-based PPI prediction [22].Additionally, PASNVGA utilizes a variational graph autoencoder to integrate sequence and network information, demonstrating superior performance across multiple datasets [23].DensePPI, proposed by Halsana et al., utilizes a deep convolutional strategy to predict PPIs with high accuracy across diverse organism datasets [24].
Furthermore, protein language models, such as ESM-2 and AlphaFold2, represent a significant advancement in computational biology [25][26][27].These models leverage deep learning techniques to predict protein structures directly from primary sequences.ESM-2, a transformerbased protein language model trained on a vast amount of protein sequence data, infers protein structures with remarkable accuracy.Similarly, AlphaFold2 excels in predicting structures from multiple sequence alignments, showcasing the potential of language models to generate accurate structure predictions.
In this research, we introduce an initial information fusion-based node representation method for protein feature presentation by using sequence and interaction network profiles.Specifically, we utilize a Gaussiankernel-based similarity metric and the Levenshtein distance metric effectively to capture the protein interaction profile and protein sequence information, respectively.To obtain an initial information matrix, a weighted features fusion technique is applied to balance the weight between the two types of information with a weighting parameter.Subsequently, we train a Stacked Autoencoder (SAE) model on the initial information fusion matrix to represent the features of proteins.Finally, an SVM classifier is employed for downstream prediction tasks.To thoroughly assess the performance of our method, we conducted experiments on three commonly used datasets by utilizing a 5-fold cross-validation strategy as used in [28][29][30][31][32]. Notably, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapien, and Helicobacter pylori datasets, respectively.Our results demonstrated the effectiveness of this approach by conducting performance comparisons with previous models.

Results
In this study, we propose to employ a feature fusion method for feature learning and a binary classifier for predicting PPIs. Figure 1 shows the overall procedure for the methodology proposed in this research.
This methodology provides a systematic approach for protein-protein interaction prediction, involving data preparation, feature fusion, node embedding, classification model selection and training, and performance evaluation.It offers a framework for accurately predicting protein interactions, thereby contributing to the understanding of biological processes.
In addition, this study utilizes different hyper-parameter alpha for feature fusion learning to obtain new features.The effectiveness of these features is then examined using SVM as a classifier, comparing accuracies to select the optimal parameter settings.These features are considered optimal numerical representations of protein node characteristics, suitable for subsequent classification tasks.By training more complex and robust Fig. 1 Schematic representation of the proposed methodology classifiers, improved classification performance can be achieved.
To evaluate model performance, evaluation metrics are used, which serve as widely adopted and standardized benchmarks for assessing model effectiveness, including accuracy (Acc.), precision (Prec.),sensitivity (Sen.),F1 score, Matthew correlation coefficient (MCC), receiver operating characteristic (ROC) curve, and area under curve of ROC (AUC).Accuracy measures the proportion of correctly classified instances, precision assesses the accuracy of positive predictions, sensitivity indicates the model's ability to correctly identify positive instances, and the F1 score provides a balance between precision and recall.The MCC considers all four confusion matrix parameters and offers a balanced measure even when classes are of different sizes.Additionally, the ROC curve illustrates the performance of a binary classifier system at various threshold settings, with AUC representing the overall classifier performance.Specifically, an AUC of 1 represents a perfect classifier that correctly ranks all positive instances higher than negative ones, while an AUC of 0.5 suggests a classifier performing no better than random chance.

Parameter selection of FFANE
In our proposed method for feature fusion learning, there is one parameter for balancing the weight between PPI network information and protein sequence information.From the definition of formula (3), parameter α ranges from 0 to 1.When the parameter α is set to 0.5, it signi- fies an equal weighting of the two types of information in the features fusion matrix.When α is set to 0, it implies that the features fusion matrix contains only sequence information.Conversely, when α is set to 1, it indicates that the features fusion matrix exclusively comprises network information.
Here, a grid search approach is employed to obtain the best parameter α .The parameter α is set to values rang- ing from 0 to 1, with intervals of 0.125.Upon establishing the parameter α configurations, we proceeded to train the SAE model to learn features corresponding to protein nodes' features fusion matrix.These extracted features were subsequently subjected to partitioning via a five-fold cross-validation methodology.The SVM classifier was employed as the downstream classification task.
Specifically, an in-depth analysis of the outcomes presented in Table 1, particularly concerning the S. cerevisiae Dataset, reveals a noteworthy pattern.The highest average accuracy, recorded at 94.28% with a standard deviation of 0.65%, materializes when the parameter α assumes a value of 0.375-noted that the weightage allocated to sequence information stands at 0.625.Corresponding, the ROC curves are plotted in Fig. 2, in which the AUCs are closer to 1 indicating the performance is more powerful.Significantly, when α is set to 0, denot- ing the exclusion of PPI interaction information in favor of sole reliance on sequence data, the average accuracy experiences a reduction, plummeting to 88.63%.Conversely, when α equals 1, the average accuracy reaches 93.37%.
When employing the proposed method on the H. sapiens Dataset, as listed in Table 2, the overall average accuracy consistently exceeds 97%, with the highest average accuracy when α is at 0.625.Correspondingly, the ROC  curves are plotted in Fig. 3, in which the value of AUC is close to 1.The performance is near perfect.Investigations concerning the H. pylori Dataset, as detailed in Table 3, unveil a peak average accuracy of 84.05% when α is set to 0.75.α values of 0 or 1 yield average accuracies that fall below 82.58%.Correspondingly, the ROC curves are plotted in Fig. 4, in which the value of the average AUC is 0.9179.The performance is effective.
From the above results, it is evident that FFANE exhibits stronger predictive performance when the alpha parameter is neither 0 nor 1, indicating that the fusion of information outperforms single-source features.

Prediction performance among different classifiers
In this section, some classic classifiers are trained, including XGBoost(XGB), Random Forest(RF), Naïve Bayes(NB).For the S. cerevisiae Dataset, H. sapiens Dataset, and H. pylori Dataset, the parameters for alpha in FFANE were set to 0.375, 0.625, and 0.75, respectively.
Tables 4 and 5, and 6 present the experimental results of our feature fusion method combined with various classifiers on three datasets.The experimental outcomes illustrate that the accuracy of the feature fusion method   combined with the XGB classifier surpasses that of the other three approaches.
In Table 4, for the S. cerevisiae dataset, the use of the XGB classifier resulted in a 5.07% accuracy improvement over the SVM classifier, a 97.79% improvement over the RF classifier and an 11.92% improvement over the NB classifier.The corresponding ROC of XGB, RF and NB is plotted in Figs. 5 and 6, and Fig. 7, respectively.In Table 5, for the H. sapiens dataset, FFANE-XGB outperforms FFANE-SVM, FFANE-RF and FFANE-NB by 5.08%, 9.8% and 11.92% in accuracy.The corresponding ROC of XGB, RF and NB is plotted in Figs. 8 and 9, and Fig. 10, respectively.
In Table 6, a similar trend is observed when applying these methods to the H. pylori dataset, where the XGB classifier demonstrates a significant increase in accuracy compared to the other three classifiers.The  These results' enhancement may be attributed to the fact that the XGBoost classifier is more advanced than the SVM, RF, and NB classifiers.This highlights the prospect of achieving superior results by integrating our feature fusion technique with the latest advancements in classification methods.

Comparison with state-of-the-art prediction methods
In this section, we compare our proposed method among the existing methods that use different types of fusion approaches based on 5-CV, also see Table 7.
Some use one kind of feature extraction.Li et al. proposed to use Scale-Invariant Feature Transform (SIFT) algorithm method on Position Weight Matrix (PWM) from protein sequences [28].Position-Specific Scoring Matric (PSSM) involves transforming protein sequences using PSI-BLAST, which is widely employed to extract sequence feature.The original matric cannot be used directly for classifier training as feature vector.To extract features, Li et al. proposed to use the Orthogonal Locality Preserving Projections (OLPP) algorithm that aims to preserve local structure and discriminative information while reducing dimensionality, resulting in fixed-length feature vectors that represent each protein [29].
Some use more than two kinds of feature extraction methods.An et al. proposed PSSM-SVM to fusion two kinds of features via Bigram Probability(BP) and Local Average Group (LAG) on PSSM [33].AE-SVM model is a predictive model that combines AE and SVM. it leverages sequence information using CT and CTD feature extraction methods [34].The AE reduces the dimensionality of the features.The functional-link Siamese neural network (FSNN-SVM) uses the fusion of features derived using pseudo amino acid composition and conjoint triad descriptors [30].The FSNN extracts the highlevel abstraction features from the raw features and SVM performs the PPI prediction task using these abstraction features.Wang et al. proposed a novel deep learning algorithm called symmetric nonnegative latent factorization (SNLF) [31].The method enhances the quality of PPI data using SNLF and encodes proteins using Quasi-Sequence-Order based on their sequence information.Principal component analysis is utilized for compact feature generation, and a graph variational AE learns protein embeddings considering features and network topology.The embeddings are then fed into a feedforward neural network for PPI prediction.StackPPI is proposed to utilize 6 kinds of features and applies XGBoost for feature noise reduction and dimensionality reduction [32].The optimized features are then analyzed using a stacked ensemble classifier consisting of random forest, extremely randomized trees, and logistic regression algorithms.
In Table 7, it is evident that when our method is applied to the S. cerevisiae dataset and the H. sapiens dataset, the accuracy of the proposed method (SVM) surpasses that of other existing methods, reaching 94.19% and 97.69%, respectively.This indicates a marked improvement in performance following feature fusion.However, when the proposed method (SVM) is applied to the H. pylori dataset, the accuracy drops to 84.05%, slightly lower than the highest accuracy of 88.47% achieved by the FSNN-SVM method.This discrepancy may be attributed to the relatively small size of the H. pylori dataset (only 2916 protein interactions), which is prone to overfitting when working with limited protein interaction data.In contrast, the other two datasets are larger, allowing our method to deliver more favorable outcomes.Consequently, our approach is better suited for larger datasets, aligning with the inevitable trend of growing protein interaction datasets as our understanding of protein interactions continues to expand.Additionally, our proposed method (XGB) outperforms proposed method (SVM) across all

Conclusion
In this research, we introduced a novel approach called FFANE that leverages feature fusion in SAE for protein feature extraction.Following an exhaustive Grid Search to determine the optimal weighting coefficients for two types of information, we obtained multiple sets of feature vectors.Subsequently, we trained SVM to test the accuracy and selected the optimal alpha value.At the optimal alpha value, the FFANE's node representation can be considered as accurately expressing node features.Moreover, we replaced the classifier with a more robust one, which typically requires longer training time compared to SVMs, but exhibits stronger classification capabilities.The effectiveness of our proposed method is validated from several perspectives.Three classical datasets were used.By tuning the parameter alpha of our proposed method from zero to one that indicates the portion between the PPI profile and sequence profile, the best value of alpha was selected.Noted that setting alpha to zero or one cannot yield the highest prediction accuracy.When compared to the state-of-the-art methods, the performance of our proposed method demonstrated that it is promising for PPI prediction.
Besides, most state-of-the-art methods are dominated by deep learning models, with protein language models showing tremendous potential, like AlphaFold and ESM-2.However, it is worth noting that deep learning models often require powerful computational resources (such as CUDA core computing capability) and considerable effort for model debugging and training.In contrast, the FFANE algorithm has modest hardware requirements, offering greater flexibility and lower time costs.When incorporating new protein profiles, we can explore fusion learning, conduct testing and validation using SVM, and compare the results with benchmark tests based on SVM mentioned in state-of-the-art algorithm works to assess effectiveness.
In future work, there are some improvements to our proposed method.Firstly, the introduction of novel feature representation methods is viable, as a more precise numerical representation of protein profiles is crucial for minimizing noise and constructing an overall robust model.Secondly, there is room for improvement in the fusion methods employed for different features.Thirdly, with the enhancement of hardware computational capabilities and the reduction in computation costs, it becomes feasible to train more complex and powerful neural networks for deeper feature learning models, including protein language model.

Methods
We developed a computational approach called FFANE to extract protein features.The proposed method integrates the Gaussian kernel similarity matrix and Levenshtein distance-based protein sequence similarities through weighted fusion, followed by Stacked Autoencoder (SAE) encoding learning, ultimately enabling accurate prediction of protein-protein interactions using machine-learning methodologies.

Datasets
In the context of academic research, three distinct datasets were selected for analysis: the Saccharomyces cerevisiae (S. cerevisiae) dataset, the Homo sapiens (H.sapiens) dataset, and the Helicobacter pylori (H.pylori) dataset.The details of the datasets are listed in Table 11.
The S. cerevisiae dataset was curated from the core subset of interacting proteins sourced from the Database of Interacting Proteins (DIP) at https://dip.doe-mbi.ucla.edu/dip [35,36].Most protein pairs we collected exhibited pairwise sequence identities below the 40% threshold upon sequence alignment.5594 pairs with positive interactions are obtained.Using sub-cellular localizations, 5,594 pairs with negative interactions are constructed, which results in accordance with the work in [35].
The H. sapiens dataset originated from the Human Protein References Database at https://hprd.org[37].The PPI dataset comprises 8161 empirically validated PPIs spanning 2835 distinct human proteins.Rigorous data curation identified 3899 unique positive PPIs and 4262 negative PPIs after excluding self-interactions and duplicate instances.
The H. pylori dataset sought to unravel the molecular intricacies underlying the bacterium's survival strategies and pathogenic tendencies [38].Comprising 808 distinctive protein entities emblematic of H. pylori, positive and negative are 1,458.These interactions were categorized into distinct classes, considering the experimental evidence supporting each, including physical association, co-expression, and co-localization.Also, the processed dataset can be downloaded at https://github.com/YuBinLab-QUST/EResCNN/tree/main/Dataset.

Construction of protein similarity
Within the framework of our proposed methodology (also see phase 1 in Fig. 1), we amalgamate protein sequence details with interaction data, subsequently harnessing SAE to facilitate feature encoding and learning.To optimize the amalgamation of these information streams, prior to the fusion procedure, we employ tailored techniques that cater to the distinct attributes of protein interaction data and sequence information.More precisely, we employ a Gaussian kernel-based similarity metric for protein interaction data and utilize the Levenshtein distance metric for sequence information before the fusion process.Specifically, the Gaussian kernel is widely used in many fields for its efficiency in refining useful information from any input.Let G = (V, E) denotes the vertexes V of proteins, as well as the edges E representing the interactions between them.Given an adjacency matrix A ∈ R n protein ×n protein of the PPI network with n protein proteins, the Gaussian ker- nel-based similarity value between the i -th protein p (i) and j -th protein p (i) is calculated as follow: Where γ r denotes the Gaussian kernel bandwidth.Its definition is as follow: ( To construct sequence-based similarity, the Levenshtein distance metric was employed.The core idea of this algorithm is to calculate the similarity between two sequences according to making the fewest modification steps (insertions, deletions, and modifications) necessary to make the sequences identical [39,40].Here, a standard Python package is introduced to learn the similarity between proteins [41].The latest release is Biopython 1.79, released on 3 June 2021 (https://biopython.org).The Biopython tool offers a series of bioinformatic analysis tools, including reverse complementation of DNA strings, searching for motifs in protein sequences, and others.Finally, a protein similarity matrix S seq can be obtained.

Feature fusion matrix
Using a single feature type cannot reveal the potential mechanism in more depth.Therefore, it is a challenging task to improve efficiency by merging different types of features.Here, we propose fusing the structural and attributed information derived from the proteins' interaction profile and sequence profile.The features fusion matrixes are computed and merged using the weighting method.
Given a Gaussian kernel-based network similarity matrixS network R n protein ×n protein and a Levenshtein distance metric-based sequence similarity matrix S seq ∈ R n protein ×n protein , the fusion matrix, is denoted as follow: where each element in the matrix represents the proximity of transition from one protein to the others, so the matrix is also called a proximity matrix.Note that the parameter α ranges from 0 to 1. Previous work Katz index focuses on emerging multiple proximity matrices with different orders, and more and more network embedding or node embedding methods like node2vec, DeepWalk, and LINE are developed to learn the node features based on the structural information [42,43].Not like these existing works by only using the limited interaction profile, our proposed method for fusing proximity matrixes aims to integrate two kinds of proteins including sequence profile and interaction profile of proteins.Such proximity matrix contains much node information that can be utilized in protein feature representation [44].

Stacked autoencoder for node embedding
The constructed fusion matrix combines the node attribute with the structural information, also called the initial information fusion matrix.More notably, the dimension of the initial information fusion matrix is N, where N represents the number of proteins, while the constructed feature vector is 2*N.Excessively high dimensions pose a catastrophic challenge to model training, often resulting in prolonged training times or even training failures.Furthermore, Such a matrix is informative but inefficient for model training and it still needs to be refined for better downstream learning tasks.SAE as a non-linear dimensionality reduction technique is widely used for feature learning of nodes with raw features.It can generate the node embedding by mapping the raw sequence or coding into a new feature space with lower dimensions but higher efficiency.The definition of SAE is as follows [45]: SAE is a deep learning model that constructs a deep neural network by stacking multiple hidden layers, leveraging the concept of AE.Each hidden layer focuses on learning different levels of abstract features from the data, progressively enhancing the representation capability of the features.A basic AE, illustrated in Fig. 14(A), can be defined in two parts: encoder and decoder.Given an original input dataset x ∈ R n , the goal of encoder is to map x into encoding feature h ∈ R d by using a trans- formation matrix W encoder ∈ R d×n , where d denotes the number of neurons in hidden layer.Then, the goal of decoder is to obtain the constructed feature ∼ x by using a transformation matrix W decoder ∈ R n×d on h .The defini- tions of encoder and decoder are as follows: where b encoder and b decoder are the parameters in the encoder and decoder, and σ(•) is the activation function.SAE learns the nodes' features without the corresponding labels, in which the parameters W encoder ,W decoder ,b encoder and b decoder are corrected and optimized by minimizing the reconstruction error between input and output via a loss function and gradient descent algorithm.The loss function can be defined as follows: where N denotes the number of samples.Further,F loss can be formulated by encoder f and decoder g as: In this study, we investigated SAE, comprising two hidden layers.The architecture of SAE is depicted in Fig. 14(B).In our SAE feature learning setup, as illustrated in Fig. 14(B), the SAE architecture utilized lacks the decoder component, employing only the encoder for the purposes of feature reconstruction learning and dimensionality reduction.Specifically, the hidden layers consist of two layers, enhancing features progressively, ultimately leading to output at the output layer.The specific parameters are as follows: N (input layer), 1024 (hidden layer), 512 (hidden layer), 128 (output layer).

Construction of support vector machine classification model
As all nodes in the heterogeneous graph are projected in a continuous vector space by using SAE, support vector machine (SVM) classifier can cooperate well with such continuous vector features to discriminate positive ones and negative ones by an optimal hyperplane.Given a constructed feature set x ∈ R n×d with n sam- ples and d dimensions as a set of protein-target data, each sample x i of x tagging to a class y can be denoted as where x ij denote j -th column feature of x i .As the opti- mal hyperplane in SVM needs to be generated to classify samples accurately based on the input training set, there are various kernels for different scenarios such as linear, sigmoid kernels, polynomial, and Gaussian radial basis function (RBF).Here, RBF kernel is selected, and the definition is as follows: where γ is an important coefficient of the kernel func- tion, i.e. kernel bandwidth.In practice, a slack variable ξ must be introduced to fix the noise in feature set, which can loosen the constrains: where w and b are the parameters adjusted by SVM for decision margin, and i ranges from 1 to n .To obtain the optimal result, the objective function is defined as follow: where C is the important parameter for penalty con- stant of training error.In this study, SVM classifiers were implemented by using the libSVM tool.

Implementation
The FFANE is a two-part process that involves constructing an initial information fusion matrix and utilizing the Stacked Autoencoder (SAE) for node representation.
To construct the initial information fusion matrix, an alpha parameter must be established, which we evaluate between 0 and 1 with intervals of 0.125.The SAE is implemented using TensorFlow in Python, with a layered architecture consisting of N (input layer), 1024 (the 1st hidden layer), 512 (the 2nd hidden layer), and 128 (output layer).Based on our experience, the maximum number of epochs, batch size and learning rate of Adam optimizer are set to 100, 32, and 0.001, respectively.EarlyStopping is utilized with a patience of 30.Additionally, the mean squared error loss function is used.

Fig. 2
Fig. 2 ROC curves for the S. cerevisiae Dataset by using SVM and SAE model on features fusion matrix with alpha at 0.375 via 5-Fold CV

Fig. 4 Fig. 3
Fig. 4 ROC curves for the H. pylori Dataset by using SVM and SAE model on features fusion matrix with alpha at 0.75 via 5-Fold CV Fig. 3 ROC curves for the H. sapiens Dataset by using SVM and SAE model on features fusion matrix with alpha at 0.625 via 5-Fold CV

Fig. 6
Fig. 6 ROC curves for the S. cerevisiae Dataset by using NB and SAE model on features fusion matrix with alpha at 0. 375 via 5-Fold CV Fig. 5 ROC curves for the S. cerevisiae Dataset by using XGBoost and SAE model on features fusion matrix with alpha at 0.375 via 5-Fold CV

Fig. 8
Fig. 8 ROC curves for the H. sapiens Dataset by using XGBoost and SAE model on features fusion matrix with alpha at 0. 625 via 5-Fold CV Fig. 7 ROC curves for the S. cerevisiae Dataset by using RF and SAE model on features fusion matrix with alpha at 0.375 via 5-Fold CV

Fig. 10
Fig. 10 ROC curves for the H. sapiens Dataset by using RF and SAE model on features fusion matrix with alpha at 0. 625 via 5-Fold CV

Fig. 12
Fig. 12 ROC curves for the H. pylori Dataset by using NB and SAE model on features fusion matrix with alpha at 0. 75 via 5-Fold CV Fig. 11 ROC curves for the H. pylori Dataset by using XGBoost and SAE model on features fusion matrix with alpha at 0.75 via 5-Fold CV

Fig. 14
Fig. 14 Schematic of the architecture of a basic AE and a SAE.(A) A basic AE with one input layer, one hidden layer, and one output layer.(B) A SAE for node embedding with one input layer, two hidden layers, and one output layer

Table 1
Prediction results for the S. cerevisiae Dataset by using SVM and SAE model on features fusion matrix with different parameter α via5-

Table 2
Prediction results for the H. sapiens Dataset by using SVM and SAE model on features fusion matrix with different parameter α via 5-fold CV

Table 3
Prediction results for the H. pylori Dataset by using SVM and SAE model on features fusion matrix with different parameter α

Table 4
Prediction results of 5-fold CV for the S. cerevisiae Dataset by using different classifiers

Table 5
Prediction results of 5-fold CV for the H. sapiens Dataset by using different classifiers

Table 6
Prediction results of 5-fold CV for the H. pylori Dataset by using different classifiers

Table 7
Performance comparison among the existing methods

Table 8
Results of statistical significance test on S. cerevisiae dataset

Table 9
Results of statistical significance test on H. pylori dataset

Table 10
Results of statistical significance test on H. sapiens dataset Fig. 13 ROC curves for the H. pylori Dataset by using RF and SAE model on features fusion matrix with alpha at 0. 75 via 5-Fold CVproposed method combined with XGB and SVM is significantly superior to other methods.

Table 11
Detail of S. cerevisiae, H. sapiens, and H. pylori dataset